for programs that run mostly “behind the scenes” (e.g. scraping)
We’ll use a notebook for convenience, but you might want to install a “full-grown” Python environment later on. (Not today’s topic.)
Hello World!
print("Hello World!")
Hello World!
# whatever succeeds a hash sign (#)# is not evaluated.# useful for comments# or to (temporarily) disable code# print("Hello World!") # skippedprint("Hello Earthling!") # evaluated
Hello Earthling!
Calculation
4+4
8
4-4
0
3*3
9
10/2
5.0
2**10
1024
5%2
1
etc.
Variables
eggs =44* eggs
16
dog_name ="Barkley"cat_name ="Cheeto"print("My dog is", dog_name,"and my cat is", cat_name)
My dog is Barkley and my cat is Cheeto
Try a simple calculation, using a variable.
Just enter some code in the first cell of your notebook and hit Ctrl+Enter on your keyboard to evaluate it.
Data Types
There are more, but these are often used.
"The Answer"# this is a String"""The Answer to theUltimate Question ofLife, the Universe,and Everything"""# this is also a String42# this is an Integer number1298.423# this is a floating point numberTrueFalse# these are Boolean values["cat", "mouse"] # this is a List{"name": "Barkley", "species": "dog"} # this is a dictionary (Dict)
Slicing
Allows you to access a part of certain data structures.
Logical operators are useful for writing conditions:
1==1
True
1!=1
False
10<5
False
1==1and ("m"in"mouse"or"x"in"dog")
True
Objects
Everything in Python is an object. Depending on their class,
objects have certain attributes, but not others
you can do certain things with them, but not others.
For example, a cup of tea is an object.
It has a color, but not a favorite movie.
You can drink it, but not read it.
You can access the attributes and methods of a python object using the dot (.) operator. Think of it as right-clicking on a file in your computer and selecting a command from the context menu.
For example, strings have a .capitalize() method that puts the string in Title Case.
"capital".capitalize()
'Capital'
Or a .count() method that counts substring occurrences.
"acetylcholinesterase".count("e")
4
Modules
Modules are ready-made Lego™ bricks you can use to build something new.
There are currently 417,030 modules on the main repository pypi.org.
import requestsfrom bs4 import BeautifulSoup
Here we import the requests module, then the BeautifulSoup function from the bs4 module (one particular brick from a Lego™ pack).
How to Do It Part 2: A Tiny Scraping Project
There are many Python modules you can use to scrape websites. A popular selection:
Module
Focus
Scrapy
complex web scraping/crawling framework
Selenium
remote-control for a web-browser, used when content is added by JavaScript
BeautifulSoup
simple but powerful parsing library for html content
R also has web scraping modules. (Not today’s topic, but see Web Scraping with R.)
The Eur-Lex Database
Eur-Lex is the main legal database of the European Union. It includes, for example
the Official Journal
legislative and preparatory documents
rulings of the EU courts
procedural information on legislation
Data can be accessed by various search interfaces. Some parts of the data are standardized and can be exported for download, obviating the need for web-scraping. An R package (by Michal Ovádek) for accessing standardized Eur-Lex data is available.
Eur-Lex’ terms of service are permissive. While they don’t explicitly mention web-scraping, they don’t exclude it either.
Proceed to “Advanced Search” (below the big search bar)
Select the “Case-law” collection
Select “Judgment” in “Document reference”
Select “Court of Justice” in “Author of the document”
Hit “Search”
Voilá! These are all judgments ever issued by the European Court of Justice.
The Eur-Lex Database
Voilá! These are all judgments ever issued by the European Court of Justice.
But that’s a lot.
Let’s try some scraping.
The Quest
Imagine we’re interested in the European Court of Justice (ECJ). We want a list of all the litigants that were ever involved in an ECJ proceeding.
There is no standardized litigant information on Eur-Lex that we could download using the export function. But litigants are mentioned in the text of each court ruling. This is a good use-case for web-scraping.
Next, we pass the url to requests’ .get method, immediately assigning the output to a variable (r).
r = requests.get(query_url)
r now contains the html of the search results page and some other information. We can access the html part using the .text method. For the first 250 letters:
traverse the branches of the html code until we arrive at useful information
or search the entire html tree for the information we need
and eventually extract it
“What html tree,” you ask?
A Website Under the Hood
<html><head><title>A Web Page</title></head><body><pid="author">Henning Deters</p><pid="subject">A Web-Scraping Primer with Python</p><ahref="https://en.wikipedia.org/">A link to Wikipedia</a></body></html>
The Eur-Lex website uses the same html syntax. It’s just a “bit” more complicated – and messy!
Text is embedded in “tags”: <p> … </p> for paragraphs, <a> … </a> for links etc.
Some tags carry additional attributes: id=..., href=...
Nesting: <p> sits between <body>, which is in <html>
Inspecting the html code on Eur-Lex
Please go to the search results, then hit F12 (Firefox) or Shift-Ctrl-J (Chrome) to open the developer tools.
click on the “element picker”
select the first search result
the tool shows you the html snippet that corresponds to the search result
It can be useful to examine the full html source. In Firefox, right-click anywhere and select “View Page Source”.
The snippet we are looking for is located
in an a tag (hyperlink),
nested within an h2 tag (heading level 2),
within a div tag with the class attribute “SearchResult”,
within even more levels …
The URL of the result appears twice. We want the second appearance, assigned to the name attribute.
Extracting the Information
We create a soup object by feeding the stuff we downloaded to BeautifulSoup. The second argument ("html5lib") tells BeautifulSoup that we’re feeding it html.
soup = BeautifulSoup(html, "html5lib")
The soup object comes with useful methods. .find_all finds all occurrences of a search term. Remember: We’re looking for something that’s nested within a div tag with the class “SearchResult”.
soup.find_all("div", {"class": "SearchResult"})
.find_all expects the first argument to be a string with the tag, and the optional second argument to be a dictionary.
.find_all returns an object similar to a list. Thus we can access the first result by attaching an index.
<div class="SearchResult" xmlns="http://www.w3.org/1999/xhtml"><h2><a class="title" href="./legal-content/AUTO/?uri=CELEX:62020TJ0279&qid=1669282930007&rid=1" id="cellar_96f68f89-6b2b-11ed-9887-01aa75ed71a1" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020TJ0279"> Judgment of the General Court ... Manifest errors of assessment.<br/>Case T-279/20.</a></h2>
As discovered earlier, the URL of the result sits within <h2> tags, which sit within <a> tags.
Since there is only one URL per result, we can ignore the <h2> tags and look just for the <a>:
search_results[0].find("a")
<a class="title" href="./legal-content/AUTO/?uri=CELEX:62020TJ0279&qid=1669282930007&rid=1" id="cellar_96f68f89-6b2b-11ed-9887-01aa75ed71a1" name="https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020TJ0279"> Judgment of the General Court ... Manifest errors of assessment.<br/>Case T-279/20.</a>
The URL after href is what your browser opens when you click on a link, but it’s truncated. Since we’re too lazy to add the missing part, we use the URL after the name attribute.
Attributes can be accessed like a dictionary. We pass "name" as key and are served the corresponding value (the URL).
Now that we have the URLs for all judgments,1 we can download each one. This might take a minute or two.
We’re just recycling code from earlier and placing it in a loop. Not terribly elegant, but fairly straightforward by now.
contents = []for u in urls:print("Scraping", u) r = requests.get(u) soup = BeautifulSoup(r.text, "html5lib") contents.append(soup)
Scraping https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62020CJ0638
Scraping https://eur-lex.europa.eu/legal-content/AUTO/?uri=CELEX:62021CJ0658
etc.
Inspecting the html of a Judgment
Please open one of the search results (for example this one) and inspect the text of the first judgment, using the developer tools.
The snippet we are looking for is located within a div tag with the id attribute “text”. It corresponds to the box that holds the text of the ruling.
Extracting the Information
Now we work with the html of all judments, which we stored in the variable contents. We narrow down the search to the main text, delimited by <div id="text">.
# narrow down to text of the rulingfor c in contents: text = c.find("div", {"id": "text"})print(type(text))
The type() functions returns the type of BeautifulSoup’s output. Some results are of NoneType, which means these particular pages did not contain the div we were looking for.
Let’s open the a corresponding URL in our list of search results to look at the offending page.
The ruling has not (yet) been translated into English, therefore the page does not have its text on it!
We’ll keep this in mind for the methodology section, and just ignore all pages that lack the text of the ruling.
# narrow down to the text of the rulingruling_texts = []for c in contents: text = c.find("div", {"id": "text"})if text isnotNone: ruling_texts.append(text)
The if statement means we include only those results in our rulings list that are not None.
Now we have a list containing the text of all the rulings. But we’re really only interested in the litigants.
Inspecting one of the search results once more, we find that the litigants are located
within <b> tags (for bold print), which are
within <p> tags with the class “C02AlineaAltA”
This is not always true, but for simplicity we pretend it is.
The .find_all_next() method of BeautifulSoup finds all branches matching a search criterion after the soup object.
# narrow down to relevant paragraphsparagraphs = []for r in ruling_texts: relevant_p = r.find_all_next("p", {"class": "C02AlineaAltA"}) paragraphs.extend(relevant_p)
Here we tell Python to give us a list of all <p> tags with the class “C02AlineaAlta” after the <div> that marks the text of the ruling on the web page. We repeat this for all ruling texts, using a for loop.
We further narrow down the relevant paragraphs to bold text (e.g. in <b> tags).
# narrow down to bold partsbold = []for p in paragraphs: x = p.find("b")if x isnotNone: bold.append(x.text)bold[0:5]
['MCM',
'Centrala studiestödsnämnden,',
'Belgisch-Luxemburgse vereniging van de industrie van plantenbescherming VZW (Belplant),',
'Vlaams Gewest,',
'Cafpi SA,']
Line 4: Again, the .find() method returns None if the current soup object does not contain a <b> tag.
We further narrow down the relevant paragraphs to bold text (e.g. in <b> tags).
# narrow down to bold partsbold = []for p in paragraphs: x = p.find("b")if x isnotNone: bold.append(x.text)bold[0:5]
['MCM',
'Centrala studiestödsnämnden,',
'Belgisch-Luxemburgse vereniging van de industrie van plantenbescherming VZW (Belplant),',
'Vlaams Gewest,',
'Cafpi SA,']
Line 4: Again, the .find() method returns None if the current soup object does not contain a <b> tag.
Line 5: We use if to ensure that those don’t end up on our list of bold paragraphs.
We further narrow down the relevant paragraphs to bold text (e.g. in <b> tags).
# narrow down to bold partsbold = []for p in paragraphs: x = p.find("b")if x isnotNone: bold.append(x.text)bold[0:5]
['MCM',
'Centrala studiestödsnämnden,',
'Belgisch-Luxemburgse vereniging van de industrie van plantenbescherming VZW (Belplant),',
'Vlaams Gewest,',
'Cafpi SA,']
Line 4: Again, the .find() method returns None if the current soup object does not contain a <b> tag.
Line 5: We use if to ensure that those don’t end up on our list of bold paragraphs.
Line 6: The .text method returns just the text between the tags, omitting the <b> ... </b>.
Final Polish
.strip() gets rid of spaces left and right
.rstrip(",") removes the commas on the right
We can chain one after the other.
# stripping commas and spaceslitigants = [x.strip().rstrip(",") for x in bold]litigants[0:3]
['MCM',
'Centrala studiestödsnämnden',
'Belgisch-Luxemburgse vereniging van de industrie van plantenbescherming VZW (Belplant)']
This is a “list comprehension”, often an elegant alternative to a for loop.
Could You Be Less Verbose, Please?
We could have done all of this in a few lines.
import requestsfrom bs4 import BeautifulSoupquery_url ="https://eur-lex.europa.eu/search.html?SUBDOM_INIT=EU_CASE_LAW&DB_TYPE_OF_ACT=judgment&DTS_SUBDOM=EU_CASE_LAW&typeOfCourtStatus=COURT_JUSTICE&DTS_DOM=EU_LAW&typeOfActStatus=JUDGMENT&page=1&lang=en&CASE_LAW_SUMMARY=false&type=advanced&DB_TYPE_COURquT=COURT_JUSTICE"soup = BeautifulSoup(requests.get(query_url).text, "html5lib")search_results = soup.find_all("div", {"class": "SearchResult"})search_results = [x.find("a", {"class": "title"}) for x in search_results]urls = [x["name"] for x in search_results]contents = [BeautifulSoup(requests.get(u).text, "html5lib") for u in urls]ruling_texts = [x.find("div", {"id": "text"}) for x in contents]paragraphs = [x.find_all_next("p", {"class": "C02AlineaAltA"}) for x in ruling_texts if x isnotNone]bold = [x.find("b") for y in paragraphs for x in y]litigants = [x.text.strip().rstrip(",") for x in bold if x isnotNone]
Very compact, but much harder to understand. When writing code, make sure you’ll be able to go back to it months later.
Caveats and Extensions
The script only downloads the first ten search results. The URL for the subsequent ten results ends in &page=2. By now you can probably guess how to scrape all results. (Hint: it involves a loop).
The script downloads the rulings and extracts information from them in one go. It’s much more efficient (and puts less burden on the sever) to first download all results and then extract the information from local files. This way you’ll only ever have to download the results once. (Writing to and reading from files involves the open() command.)
Eur-Lex is not very consistent in how it encodes information in html. For example, litigants are not always printed in boldface or enclosed in <p> tags with a certain class. Sometimes you have to search for text patterns instead. (This involves “regular expressions”, available in the re module).
Resources
Gentle Python Introductions
Freeman, Eric. Head First Learn to Code. Beijing: O’Reilly, 2018.
Sweigart, Al. Automate the Boring Stuff with Python: Practical Programming for Total Beginners. 2nd edition. San Francisco: No Starch Press, 2020, full text.
Atteveldt, Wouter van, Damian Trilling, and Carlos Arcíla. Computational Analysis of Communication: A Practical Introduction to the Analysis of Texts, Networks, and Images with Code Examples in Python and R. Hoboken, NJ: John Wiley & Sons, 2021, full text.
McLevey, John. Doing Computational Social Science: A Practical Introduction. Thousand Oaks: SAGE Publications, 2021.
Image Sources
Javier Allegue Barros https://unsplash.com/photos/0nOP5iHVaZ8